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Abstract 

The additive model is one of the most popular semiparametric models. The back- 
fitting estimation (Buja, Hastie and Tibshirani, 1989, Ann. Statist.) for the model 
is intuitively easy to understand and theoretically most efficient (Opsomer and Rup- 
pert, 1997, Ann. Statist.); its implementation is equivalent to solving simple linear 
equations. However, convergence of the algorithm is very difficult to investigate and 
is still unsolved. For bivariate additive models, Opsomer and Ruppcrt (1997, Ann. 
Statist.) proved the convergence under a very strong condition and conjectured that 
a much weaker condition is sufficient. In this short note, we show that a weak condi- 
tion can guarantee the convergence of the backfitting estimation algorithm when the 
Nadaraya- Watson kernel smoothing is used. 

Key words: additive model; backfitting algorithm; convergence of algorithm; kernel 
smoothing. 



1 Introduction 

The additive model has been proved to be a very useful semiparametric model and is 
popularly used in practice. An intuitive implementation of the estimation is the backfitting 
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approach (Buja, Hastie and Tibshirani, 1989, called BHT hereafter). It is noticed that the 
implementation can be done easily by solving linear normal equations (pp. 476, BHT) if 
the backfitting algorithm converges. However, to justify the convergence of the algorithm 
is not easy. BHT provided sufficient conditions that guarantee the convergence of the 
backfitting algorithm or, equivalently, the existence of the estimators. These conditions 
are only generally satisfied by regression splines and other methods, but not by kernel 
smoothing. Some other approaches (e.g. Tj0stheim and Auestad, 1994; Linton and Nielsen, 
1995; Mammen, Linton and Nielsen, 1999; Wang and Yang, 2007) have been proposed to 
avoid hard problems about the convergence of algorithm and the asymptotics of estimators. 
However, the original backfitting of BHT is still one of the most intuitive approach. 

Opsomer and Ruppert (1997, called OR hereafter) investigated the algorithm's conver- 
gence for the local polynomial kernel smoothing when the predictors are bivariate. Suppose 
Y is the response and (U, V) is the bivariate predictors satisfying the additive model 

Y = a + m 1 (U) + m 2 {V) + e, (1) 

where E(e\U, V) = almost surely. Constraints E{mi(U)} = E{m2(V)} = are usually 
imposed for model identification; see for example OR. It is known (see, e.g. BHT) that the 
terms in the model are the solution to minimizing 

min E{Y - a - mi(U) - m 2 (V)} 2 , (2) 

m. 1 eL 2 ,m 2 eL2, 

where L 2 is the measurable functional space with finite second moments. Let f(u,v), fi(u) 
and f 2 (v) be the joint density function and marginal density functions of (U, V), U and V 
respectively. OR required that 

f(u,v) 



sup 

U,V 



1 



< 1 



h{u)h{v) 

to prove the convergence of the backfitting algorithm. This requirement is very stringent 
and even excludes a big part of the normal distributions. However, OR conjectured that 
the algorithm convergence can be guaranteed under very week conditions. Next, we shall 
prove that their conjecture is correct when the Nadaraya- Watson kernel is used. 



2 Main results 



Suppose {(Yi,Ui,Vi) : i = l,...,n} is a random sample from model ([T]). Following BHT, 
let mi = {mi(Ui),...,mx(U n )) T , m 2 = (m 2 (Vi), m 2 (V n )) T and Y = {Y h Y n ) T . The 
estimators of functions m\ and m 2 are determined by the estimation of funcation values at 
the observed points, i.e. mi and rri2. Let K{.) > be kernel function and -fQi(-) = K(./h)/h 
for any h > 0. 

For the estimation of function values at Ui and Vi , we use (varying) bandwidth hi > and 
hi > respectively and kernel weights 4 = [K hi (Ui-Ui), K hi (Ui-U n )] T / Y^k=i K hi(Ui- 
U k ) and ui = [K^Vi - V x ), - K)] T /ELi^(^ " V k ). Let 











Si = 




s 2 = 


W7 



Corresponding to constraints E{mi(U)} = E{m 2 (V)} = 0, we introduce (I n — l n lj/n), 
where I n is the ft x 7i identity matrix and. 1^ is a vector of n x 1 with all entries 1. Let 
S| = (I n — l n lJ/n)Si and S?j = (I n — lnln/ n )S2- Using kernel smoothing, the backfitting 
estimation procedure is iteratively 

mf ™ := SJ{Y - m^}, m% ew := S* 2 {Y - mf d }. 

As BHT pointed out, the final estimators rhi and rri2 of the algorithm are equivalent to the 
solution of 

(In Sf\/mA/S|\ 
\S* 2 ij {rh 2 J \S* 2 ) ■ 

The solution exists if the inverse of (I n — S^Sf) or (I n — SJSij) exits. If the iteration 
converges, then estimators of a, rhi and 1112 are respectively a = Y, 

A 1 = s;(i n -s^)- 1 (i„-s5)Y 

and 

m 2 = (I„ - S^S|) -1 S^(I„ - Sl)Y 
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(the solutions can be rewritten in different forms). As we can see, the backfitting estimation 
is very easy to implement and is equivalent to a one-step calculation, if it converges. Thus, 
convergence of the algorithm is essential for the estimation of the additive model. 

Theorem 1 Denote the order statistics of {U±, ...,U n } and {V±, ...,V n } by {U^, ...,LA n i} 
and {Vjx|, Vr n i} respectively, and their corresponding bandwidths by {hm, ht n ]} and 
{Hm,..., fo[ n ]} respectively. If kernel function K(.) and the bandwidths satisfy K(0) > 0, 

K h (U {A - £/[*_!]) > 0, K h (U [{] - U [i+1] ) > 0, 

(3) 

K *K (Yin -V ] i- 1] )>0, K h[i] (V {i] - V [i+1] ) > 0, 
for 1 < i < n, and 

K h (U W - U [2] ) > 0, K h (U [n] - U [n _ A ) > 0, 

(4) 

K hl] (V {1] - V [2] ) > 0, K hn] (V [n] - V [n „ l} ) > 0, 
then the backfitting algorithm converges. 

Remark 1 Suppose K{.) is a symmetric kernel function with K{v) > for all \v\ < 1 and 
that global (constant) bandwidthes h and h are used. If h and h are bigger than the largest 
difference between any two nearest points respectively, i.e. 

h > max{J7u + i| — U]a,i = 1, ...,n — 1} and H > max{Vrj +1 i — Vm,i = 1, ...,n — 1}, (5) 

then ^ and ^ hold. By Theorem 1 the convergence of the algorithm is guaranteed. 

Corollary 1 Suppose U and V are distributed on two compact intervals respectively with 
density functions bounded away from 0. If global (constant) bandwidths h and h are used 
with h, h — ► and nh/\og(n),nh/\og(n) — > oo, then the algorithm converges in probability 
as n is large enough. 

Remark 2 It is remarkable that the range of bandwidths for the algorithm to converge 
is quite wide, and that bandwidths h oc n~ 5 and H oc n~ s with < 5 < 1 satisfy the 
requirement in Corollary 1. Thus, the algorithm converges. These bandwidths include the 
optimal bandwidths where 5 = 1/5 (see, e.g. OR). 



This short note only considers the bivariate case with Nadaraya- Watson kernel smooth- 
ing. We conjecture that the backfitting estimation still converges under weak conditions 
for general additive models and other kernel estimation methods including the local poly- 
nomial smoothing. After the convergence is justified, asymptotics of the estimators can 
be obtained following exactly the same arguments of Opsomer and Ruppert (1997). The 
details are omitted. 

3 Proofs 

The proof of Theorem 1 is based on the properties of the regular Markov chain and the 
Perron- Frobenius theorem (see, e.g. Mine, 1988). The proof of Corollary 1 is based on the 
properties of order statistics (see, e.g. David and Nagaraja, 2003). 

Proof of Theorem 1. We first prove that the absolute eigenvalues of Si are all smaller 
than 1 with only one exception that equals 1. It is easy to see that Si is a probability 
transition matrix of the Markov chain. By conditions ([3]) and Si is irreducible and 
aperiodic. Therefore it is a regular transition probability matrix. There is an integer k such 
that all entries in S^ are strictly positive (see, e.g. Romanovsky, 1970, Theorem 14.1). By 
the Perron-Frobenius theorem, there is one (and only one) eigenvalue Ai of multiplicity 1 
such that all entries in its corresponding eigenvector are positive. It is easy to see that this 
eigenvalue is Ai = 1 and its eigenvector is 6 = l n /y/n, because the sum of any row in Si 
is 1. Let A2, A n be the other n — 1 eigenvalues of Si (repeated eigenvalues are counted 
repeatedly). The Perron-Frobenius theorem also indicates that 1 = Ai > max{|A2|, |A n |}. 

Next, we show that the absolute eigenvalues of SJ = (I n — 66 T )Si are all strictly smaller 
than 1. Suppose that the eigenvalues A2, A n of Si are distinct and their corresponding 
eigenvectors are f32,---,(3 n respectively (The general argument is similar, but needs more 
complicated notation). It is easy to check that and (l n — 69 T )Pk,k = 2,...,n are the 
eigenvectors of S^ with corresponding eigenvalues being and A2, A n respectively, because 

(i n - 00 T )Si0 = (i n - ee T )\ 1 e = 
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and 

(I n -99 T )S 1 (I n -ee T )(3 k = (I n -99 T ){S 1 (3 k -S 1 99 T (3 k } 

= (I n - 99 T ){\ k (3 k - Ai^ T /3 fe } 

= \ k (I n -99 T )f3 k -\ 1 (I n -99 T )99 T (3 k 

= \ k (I n -99 T )p k , for k = 2, ...,n. 

Since the absolute values of 0, A2, A n are all smaller than 1, we proved that the absolute 
eigenvalues of are smaller than 1. Applying the same argument to S2, we have the 
absolute values of all eigenvalues of S2 are smaller than 1. 

Since the largest absolute eigenvalues of both SJ and S2 are smaller than 1, the absolute 
values of all eigenvalues of S^SJ and S^S^ are also smaller than 1. It follows that the 
inverses of (I n — S^S^) and (I n — S^S^) exist, and thus the algorithm converges. □ 

Proof of Corollary 1. It is easy to check 

P(h > max{U [i+1] - U^,i = 1, ...,n - 1}, h > max{Vj m] - Vj^i = 1, ...,n - 1}) 
> 1 - P(h < max{C/[ i+ i] - U^,i = l,...,n- 1}) 

-P(ft<max{Vj i+1] -%* = l,...,n-l}). (6) 

Consider the second term above. We have 

n-l 

P(h<max{U [i+1] -U [{l ,i = l,...,n-l}) < ^P(h < U [i+1] - (7) 

i=i 

Let F be the cumulative probability function of U . Then U' = F(U) is uniformly distributed 
on [0, 1]. Let UL = F(U^). By the joint distribution of (UL, U! i+1 ^) (see, e.g. David and 
Nagaraja, 2003) and simple calculation, we have for any c > 



p{c<v {i+n -u [A ) - ( > ^ (i _ 1)!( l i _ 1) i "-(i-ur-'^ 

(l-c) n , if0<c<l, 
0, ifol. 

Let Co = inf{/ _1 (u),0 < u < 1}, which is positive by the assumption. Note that Uu\ = 
GiU'w), where G is the inverse function of F. By the property of inverse function, we have 



U^-U^^coiU^-U^). Thus 



P(coc < U [i+1] - U {il ) < P{c<U{ 



U k) = { 







(1-c) 



if < c < 1 
if c> 1. 



When n is large, we can assume h < 1. It follows that 



n-l 



£P(fc<E/ii+i]-l7 M ) < n(l-/i) 



nexp{(n — 1) log(l — h)} 



< nexp{(n- l)(-h + h 2 /2)} < nexp{-(n - l)h/2} 







(8) 



as n — > oo. Condition nh/log(n) — > oo is used in the last step of (jHJ). By (JT]) and ©, we 
have 

< max{C/[ i+1 ] - J7[i],i = 1, ...,n - 1}) 
as n — > oo. Similarly, we can show that 

P(h < max{V[ i+1 ] - Vjj],i = 1, ...,n- 1}) -> 

as n — > oo. It follows from ([6]) and the two equations above that 

P(h > max{C/[ i+1 ] - U$,i = 1, ...,n - 1}, h > max{V [i+1 ] - V[q,i = 1, ...,n- 1}) 1 

as n — > oo. By Remark 1 and ([5]), the algorithm converges in probability as n — > oo. □ 

Acknowledgements: The author thanks an associate editor, a referee and Professor Z. D. 
Bai for their very valuable comments. The research was partially supported by the National 
Natural Science Foundation of China (Grant no. 10471061). 

References 

Buja, A., Hastie, T. and Tibshirani, R. (1989). Linear smoothers and additive models 
(with discussion). Ann. Statist. 17 453-555. 

David, H. A. and Nagaraja, H. N. (2003). Order Statistics. Wiley, New Jersey. 



7 



Linton, O. and Nielsen, J. P. (1995). A kernel method of estimating structured nonpara- 
metric regression based on marginal integration. Biometrika 82 93-100. 

Mammen, E., Linton, O. and Nielsen, J. P. (1999). The existence and asymptotic prop- 
erties of a backfitting projection algorithm under weak conditions. Ann. Statist. 27 
1443-1490. 

Mine, H. (1988). Nonnegative Matrices. New York: Wiley. 

Opsomer, J. D. and Ruppert, D (1997). Fitting a bivariate additive model by local poly- 
nomial regression. Ann. Statist. 25 186-211. 

Romanovsky, V. I. (1970). Discrete Markov Chains. Wolters-Noordhoff Publishing, Gronin- 
gen, Netherlands. 

Tj0stheim, D. and Auestad, B. (1994). Nonparametric identification of nonlinear time 
series: Projections. J. Amer. Statist. Assoc. 89 1398-1409. 

Wang, L. and Yang, L. (2007). Spline-backfitted kernel smoothing of nonlinear additive 
autoregression model. Ann. Statist. 35 2474-2503 



8 



